Spatially explicit yield gap decomposition by management and policy variables using Bayesian geoadditive distributional efficiency approach

1 Introduction

We follow Silva (###) approach of calculating and decomposing yield gaps in agriculture. In this approach, yield gaps are decomposed into technology yield gaps, resource yield gaps and efficiency yield gaps.

We address two challenges and potential advancements that have been identified in the literature on this approach, (i) granular decomposition of yield gaps beyond technology, resource, and efficiency yield gaps to actually consider the contribution of management and policy relevant variables like early sowing, and weed management by decomposing the efficiency yield gaps, and contribution to reducing inputs through decomposition of the resource gaps, and (2) spatially granular decomposition of yield gaps by demonstrating in which location or grid is each of the management practices more relevant to close the yield gaps.

In this workbook we will a recent approach of estimating marginal impact on efficiency for each of the efficiency variables at spatially disaggregated level and demonstrating that this approach can allow one to explore the sources of inefficiency at a spatially disaggregated level.

This then allow one to decompose the efficiency yield gap into policy and action oriented components for which farmers and other stakeholders can attempt to make adjustments. We also follow Pross et al () in identify resource yield gaps and identifying which input usage can be reallocated. We do these decompositions at a spatially disaggregated level thereby allowing one to identify if policies related to say subsidies or input assistance can help in increasing yields or whether other management mechanisms (for example split application) can be more helpful in a particular location.

On understanding which aspects of the efficieny yield gaps should be prioritized where, we decompose these into genotype, environment, management and socioeconomics (GEMS).

2 Conventional Yield Gap Decomposition (following on Silva et al)

Code
# package names
packages <- c("ggplot2", "micEcon", "frontier", "dplyr", "tidyr", "knitr", "car", "RColorBrewer", "DT", "rio", "tidyr", "dsfa", "mgcv", "geodata", "sf", "mapview", "dplyr", "terra", "raster")

# install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    install.packages(packages[!installed_packages])
}

# load packages
invisible(lapply(packages, library, character.only = TRUE))
# install.packages("collapse", repos = "https://fastverse.r-universe.dev")

LDS_wheat_public_cleaned <- import("LDS_wheat_public_cleaned.csv")

# Using sfa function in frontier

sfa_f <- sfa(log(L.tonPerHectare) ~ Nperha + wc2.1_30s_elev + temp + precip | Sowing_Date_Early + Weedmanaged, data = LDS_wheat_public_cleaned)

# see parameter estimates
summary(sfa_f)
Efficiency Effects Frontier (see Battese & Coelli 1995)
Inefficiency decreases the endogenous variable (as in a production function)
The dependent variable is logged
Iterative ML estimation terminated after 20 iterations:
log likelihood values and parameters of two successive iterations
are within the tolerance limit

final maximum likelihood estimates
                       Estimate  Std. Error z value  Pr(>|z|)    
(Intercept)          3.3201e-02  2.7087e-01  0.1226  0.902446    
Nperha               1.7599e-03  8.8421e-05 19.9036 < 2.2e-16 ***
wc2.1_30s_elev      -6.4161e-04  1.4956e-04 -4.2900 1.787e-05 ***
temp                 4.0423e-02  1.0308e-02  3.9216 8.795e-05 ***
precip               4.0003e-05  1.2590e-05  3.1775  0.001486 ** 
Z_(Intercept)        3.0689e-01  4.4891e-02  6.8363 8.126e-12 ***
Z_Sowing_Date_Early -5.9834e-01  9.0190e-02 -6.6343 3.261e-11 ***
Z_Weedmanaged       -3.9400e-01  4.3002e-02 -9.1623 < 2.2e-16 ***
sigmaSq              1.6181e-01  1.4937e-02 10.8326 < 2.2e-16 ***
gamma                7.7626e-01  1.9322e-02 40.1743 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
log likelihood value: -769.1422 

cross-sectional data
total number of observations = 7647 

mean efficiency: 0.789714 
Code
# Efficiency scores

eff <- efficiencies(sfa_f, type = "battese")
eff <- as.data.frame(eff)
ggplot(data = eff) +
    geom_histogram(aes(x = efficiency), bins = 20) +
    # scale_x_continuous(breaks = scales::pretty_breaks(n = 5), expand = c(0, 0)) +
    labs(title = "Battese Efficiency Score", x = "Efficiency score") +
    theme_bw()

Code
# Yield gap
eff_yield_gap_perc <- 100 - (eff$efficiency * 100)
eff_yield_gap_perc <- as.data.frame(eff_yield_gap_perc)

ggplot(data = eff_yield_gap_perc) +
    geom_histogram(aes(x = eff_yield_gap_perc), bins = 20) +
    # scale_x_continuous(breaks = scales::pretty_breaks(n = 5), expand = c(0, 0)) +
    labs(title = "Efficiency yield gap", x = "Efficiency yield gap") +
    theme_bw()

Code
eff_d <- sfa_f$dataTable
eff_d <- as.data.frame(eff_d)
eff_d$ln_yield <- eff_d$`log(L.tonPerHectare)`
eff_d$yield_t_ha <- exp(eff_d$ln_yield)
# Efficiency yield: maximum yield
eff_d$max_eff_yield <- eff_d$yield_t_ha / eff$efficiency

ggplot(data = eff_d) +
    geom_histogram(aes(x = max_eff_yield), bins = 20) +
    labs(title = "Maximum efficiency yield", x = "Maximum efficiency yield") +
    theme_bw()

Code
# Yield gap in tons per ha
eff_d$yield_gap_t_ha <- eff_d$max_eff_yield - eff_d$yield_t_ha

ggplot(data = eff_d) +
    geom_histogram(aes(x = yield_gap_t_ha), bins = 20) +
    labs(title = "Efficiency yield gap(t/ha)", x = "Efficiency yield gap(t/ha)") +
    theme_bw()

Code
# Yield gap reduction due to early sowing in tons/ha
eff_d$early_sowing_yield_gap_redn <- eff_d$yield_gap_t_ha * eff$efficiency

ggplot(data = eff_d) +
    geom_histogram(aes(x = early_sowing_yield_gap_redn), bins = 20) +
    labs(title = "Early sowing effect (t/ha)", x = "Early sowing effect (t/ha)") +
    theme_bw()

Code
# Input savings due to early sowing in tons per ha

3 Geoadditive stochastic frontier analysis

3.1 Estimation using thin plate spline

Code
# LDSestim_gam=LDSestim %>% drop_na()


# Write formulae for parameters for the mean model, sigma model, and inefficiency model
# We use the original functional forms: identity for mean model, log sigma for sigma model and log ineff for the inefficiency

mu_formula <- log(L.tonPerHectare)~Nperha + wc2.1_30s_elev + s(O.largestPlotGPS.Longitude,
    O.largestPlotGPS.Latitude,
    by = Weedmanaged
)

sigma_v_formula <- ~ 1 + temp + precip

sigma_u_formula <- ~ 1 + s(O.largestPlotGPS.Longitude, O.largestPlotGPS.Latitude, by = Sowing_Date_Early) + s(O.largestPlotGPS.Longitude,
    O.largestPlotGPS.Latitude,
    by = Weedmanaged
)

s <- -1 # production function specification

# Fit model
# If using older versions of R you can use the following code
# model <- mgcv::gam(
#     formula = list(mu_formula, sigma_v_formula, sigma_u_formula),
#     data = LDS_wheat_public_cleaned, family = normhnorm(s = s), optimizer = c("efs")
# )

model_dsfa<-dsfa(formula=list(mu_formula, sigma_v_formula, sigma_u_formula),
                 data=LDS_wheat_public_cleaned,  family=comper(s=s, distr="normhnorm"), optimizer = c("efs"))
summary(model_dsfa)

Family: comper 
Link function: identity logshift logshift 

Formula:
log(L.tonPerHectare) ~ Nperha + wc2.1_30s_elev + s(O.largestPlotGPS.Longitude, 
    O.largestPlotGPS.Latitude, by = Weedmanaged)
~1 + temp + precip
~1 + s(O.largestPlotGPS.Longitude, O.largestPlotGPS.Latitude, 
    by = Sowing_Date_Early) + s(O.largestPlotGPS.Longitude, O.largestPlotGPS.Latitude, 
    by = Weedmanaged)

Parametric coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     1.090e+00  2.008e-02  54.266  < 2e-16 ***
Nperha          1.315e-03  8.227e-05  15.987  < 2e-16 ***
wc2.1_30s_elev -7.576e-04  2.066e-04  -3.667 0.000245 ***
(Intercept).1   3.487e+00  1.331e+00   2.620 0.008802 ** 
temp.1         -2.054e-01  5.082e-02  -4.042  5.3e-05 ***
precip.1        7.829e-05  5.904e-05   1.326 0.184763    
(Intercept).2  -8.638e-01  2.758e-02 -31.316  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                                                              edf
s(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged         27.94
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early 24.25
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged       21.76
                                                                            Ref.df
s(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged          29.46
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early  27.61
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged        26.09
                                                                            Chi.sq
s(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged          617.4
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early  205.7
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged        210.8
                                                                            p-value
s(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged          <2e-16
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early  <2e-16
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged        <2e-16
                                                                               
s(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged         ***
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early ***
s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Weedmanaged       ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


-REML = 330.56  Scale est. = 1         n = 7647
Code
plot(model_dsfa, select = 1) # Estimated function

Code
plot(model_dsfa, select = 2) 

3.2 Efficiency calculations

Code
# fitted values
tau_hat=model_dsfa$fitted.values
tau_hat=as.data.frame(tau_hat)
names(tau_hat)[1:3]=c("mu","sigma_v","sigma_u")


#Manual calculation of efficiency: Battese
tau_hat$sigma_c<-sqrt((tau_hat$sigma_u^2*tau_hat$sigma_v^2)/(tau_hat$sigma_v^2+tau_hat$sigma_u^2))#(1/sigma_u^2+1/sigma_v^2)^(-1)

model_dsfa_data=model_dsfa$model
tau_hat$mu_c<- -(model_dsfa_data$`log(L.tonPerHectare)`-tau_hat$mu)/tau_hat$sigma_v^2*tau_hat$sigma_c^2 #object$family$s*mgcv::residuals.gam(object)/sigma_v^2*sigma_c^2
 
tau_hat$u<-exp(1/2*tau_hat$sigma_c^2-tau_hat$mu_c)*stats::pnorm(tau_hat$mu_c/tau_hat$sigma_c-tau_hat$sigma_c)/stats::pnorm(tau_hat$mu_c/tau_hat$sigma_c)


# Package based calculation
tau_hat$eff <- efficiency(model_dsfa, type = "battese")


# Jondrow efficiency measure
tau_hat$sigma_c_j<-sqrt((tau_hat$sigma_u^2*tau_hat$sigma_v^2)/(tau_hat$sigma_v^2+tau_hat$sigma_u^2))#(1/sigma_u^2+1/sigma_v^2)^(-1)



tau_hat$mu_c_j<--(model_dsfa_data$`log(L.tonPerHectare)`-tau_hat$mu)/tau_hat$sigma_v^2*tau_hat$sigma_c_j^2#object$family$s*mgcv::residuals.gam(object)/sigma_v^2*sigma_c^2

tau_hat$u_jondrow<-tau_hat$mu_c_j+tau_hat$sigma_c_j*stats::dnorm(tau_hat$mu_c_j/tau_hat$sigma_c_j)/stats::pnorm(tau_hat$mu_c_j/tau_hat$sigma_c_j)

tau_hat$eff_j <- efficiency(model_dsfa, type = "jondrow")

# CI_lower<-mu_c+stats::qnorm(lower)*sigma_c
# CI_upper<-mu_c+stats::qnorm(upper)*sigma_c
# CI_lower<-exp(-CI_upper)
# CI_upper<-exp(-CI_lower)

3.3 Geoadditive mapping of the efficiency scores, and efficiency yield gaps

Code
lon <- model_dsfa_data$O.largestPlotGPS.Longitude
lon <- as.data.frame(lon)
lat <- model_dsfa_data$O.largestPlotGPS.Latitude
lat <- as.data.frame(lat)


effdt_battese <- cbind(lon, lat, model_dsfa_data, tau_hat)

library(bamlss)
set.seed(111)
f <- u ~  s(lon, lat)

## estimate model.
b <- bamlss(f, data = effdt_battese)
AICc -11690.2 logPost 5882.886 logLik 5874.602 edf 29.358 eps 0.0460 iteration   1
AICc -11690.7 logPost 5883.059 logLik 5875.074 edf 29.576 eps 0.0013 iteration   2
AICc -11690.7 logPost 5883.054 logLik 5875.090 edf 29.592 eps 0.0000 iteration   3
AICc -11690.7 logPost 5883.054 logLik 5875.090 edf 29.592 eps 0.0000 iteration   3
elapsed time:  0.26sec
Starting the sampler...

|                    |   0% 20.23sec
|*                   |   5% 18.62sec  0.98sec
|**                  |  10% 18.54sec  2.06sec
|***                 |  15% 17.23sec  3.04sec
|****                |  20% 17.00sec  4.25sec
|*****               |  25% 16.26sec  5.42sec
|******              |  30% 16.01sec  6.86sec
|*******             |  35% 15.08sec  8.12sec
|********            |  40% 13.89sec  9.26sec
|*********           |  45% 12.75sec 10.43sec
|**********          |  50% 11.56sec 11.56sec
|***********         |  55% 10.41sec 12.72sec
|************        |  60%  9.35sec 14.03sec
|*************       |  65%  8.50sec 15.78sec
|**************      |  70%  7.50sec 17.50sec
|***************     |  75%  6.47sec 19.42sec
|****************    |  80%  5.21sec 20.84sec
|*****************   |  85%  3.95sec 22.39sec
|******************  |  90%  2.66sec 23.95sec
|******************* |  95%  1.33sec 25.20sec
|********************| 100%  0.00sec 26.67sec
Code
# Boundary map
India <- gadm(country = "IND", level = 2, path = "shp")
plot(India)

Code
India_aoi <- subset(India, India$NAME_1 == "Bihar" | India$NAME_2 %in% c("Ballia", "Chandauli", "Deoria", "Ghazipur", "Kushinagar", "Maharajganj", "Mau", "Siddharth Nagar", "Gorakhpur"))

plot(India_aoi)

Code
India_aoi_sf <- st_as_sf(India_aoi)

mapview(India_aoi_sf)
Code
India_aoi_sf_dis <- st_union(India_aoi_sf)


# Predict
elevationglobal_geodata <- elevation_global(0.5, path = "raster")

elevationglobal_geodata_aoi <- terra::crop(elevationglobal_geodata, India_aoi_sf_dis)

library(raster)
elevationglobal_geodata_aoi <- raster(elevationglobal_geodata_aoi)
plot(elevationglobal_geodata_aoi)

Code
pred <- SpatialPoints(elevationglobal_geodata_aoi)@coords
pred <- as.data.frame(pred)
names(pred)[1:2] <- c("lon", "lat")
# pred <- expand.grid(lon = seq(82, 89, length = 100),lat = seq(24,28, length = 100))

effdt_battese_hat <- predict(b, newdata = pred)

effdt_battese_hat <- as.data.frame(effdt_battese_hat)
pred_effdt_battese_hat <- cbind(pred, effdt_battese_hat)

pred_effdt_battese_hat$sigma <- NULL
library(terra)

myras <- rast(pred_effdt_battese_hat, type = "xyz")
plot(myras)

Code
library(raster)
myras2 <- raster(myras)

library(sf)
India_aoi_sf_dis_sp <- as_Spatial(India_aoi_sf_dis)
myras2 <- mask(myras2, India_aoi_sf_dis_sp)
plot(myras2, main = "Gridded efficiency scores")

Code
# Estimate the efficiency yield gap (%)

effic_yield_gap_perc <- 100 - myras2 * 100
plot(effic_yield_gap_perc, main = "Efficiency yield gap(%)")

Code
# Efficiency yield
## Krige yields over the area of intersect
## use the predicted yields and efficiency score
library(bamlss)
set.seed(111)
effdt_battese$L.tonPerHectare <- exp(effdt_battese$`log(L.tonPerHectare)`)
f_y <- L.tonPerHectare ~  s(lon, lat)

## estimate model.
b_y <- bamlss(f_y, data = effdt_battese)
AICc 17775.44 logPost -8914.16 logLik -8858.47 edf 29.131 eps 0.2437 iteration   1
AICc 17771.25 logPost -8914.52 logLik -8855.10 edf 30.394 eps 0.0217 iteration   2
AICc 17771.25 logPost -8914.53 logLik -8855.09 edf 30.406 eps 0.0002 iteration   3
AICc 17771.25 logPost -8914.53 logLik -8855.09 edf 30.406 eps 0.0000 iteration   4
AICc 17771.25 logPost -8914.53 logLik -8855.09 edf 30.406 eps 0.0000 iteration   4
elapsed time:  0.69sec
Starting the sampler...

|                    |   0% 40.46sec
|*                   |   5% 53.96sec  2.84sec
|**                  |  10% 55.71sec  6.19sec
|***                 |  15% 50.55sec  8.92sec
|****                |  20% 56.56sec 14.14sec
|*****               |  25% 51.00sec 17.00sec
|******              |  30% 45.50sec 19.50sec
|*******             |  35% 45.76sec 24.64sec
|********            |  40% 41.76sec 27.84sec
|*********           |  45% 36.94sec 30.22sec
|**********          |  50% 32.17sec 32.17sec
|***********         |  55% 30.59sec 37.39sec
|************        |  60% 26.58sec 39.87sec
|*************       |  65% 23.05sec 42.80sec
|**************      |  70% 18.91sec 44.12sec
|***************     |  75% 15.14sec 45.42sec
|****************    |  80% 12.16sec 48.65sec
|*****************   |  85%  8.83sec 50.01sec
|******************  |  90%  5.70sec 51.26sec
|******************* |  95%  2.76sec 52.53sec
|********************| 100%  0.00sec 53.78sec
Code
effdt_battese_hat_y <- predict(b_y, newdata = pred)

effdt_battese_hat_y <- as.data.frame(effdt_battese_hat_y)
pred_effdt_battese_hat_y <- cbind(pred, effdt_battese_hat_y)

pred_effdt_battese_hat_y$sigma <- NULL
library(terra)

myras_y <- rast(pred_effdt_battese_hat_y, type = "xyz")
plot(myras_y)

Code
library(raster)
myras2_y <- raster(myras_y)

library(sf)
India_aoi_sf_dis_sp <- as_Spatial(India_aoi_sf_dis)
myras2_y <- mask(myras2_y, India_aoi_sf_dis_sp)
plot(myras2_y, main = "Gridded wheat yields (t/ha)")

Code
# Maximum efficient yield
myras_max_eff_yield <- myras2_y / myras2

plot(myras_max_eff_yield, main = "Maximum efficient yield(t/ha)")

Code
# Predicted yields from the model
library(bamlss)
set.seed(111)
effdt_battese$exp_mu <- exp(effdt_battese$mu)

f_y_hat <- exp_mu ~  s(lon, lat)

## estimate model.
b_y_hat <- bamlss(f_y_hat, data = effdt_battese)
AICc 2921.270 logPost -1488.30 logLik -1430.01 edf 30.496 eps 0.1408 iteration   1
AICc 2731.054 logPost -1394.31 logLik -1334.50 edf 30.895 eps 0.0425 iteration   2
AICc 2728.682 logPost -1393.13 logLik -1333.29 edf 30.915 eps 0.0049 iteration   3
AICc 2728.681 logPost -1393.13 logLik -1333.29 edf 30.917 eps 0.0000 iteration   4
AICc 2728.681 logPost -1393.13 logLik -1333.29 edf 30.917 eps 0.0000 iteration   4
elapsed time:  0.61sec
Starting the sampler...

|                    |   0% 33.32sec
|*                   |   5% 36.10sec  1.90sec
|**                  |  10% 44.73sec  4.97sec
|***                 |  15% 36.83sec  6.50sec
|****                |  20% 37.24sec  9.31sec
|*****               |  25% 31.86sec 10.62sec
|******              |  30% 27.74sec 11.89sec
|*******             |  35% 27.65sec 14.89sec
|********            |  40% 25.14sec 16.76sec
|*********           |  45% 22.44sec 18.36sec
|**********          |  50% 19.64sec 19.64sec
|***********         |  55% 18.88sec 23.08sec
|************        |  60% 16.30sec 24.45sec
|*************       |  65% 14.19sec 26.36sec
|**************      |  70% 11.88sec 27.73sec
|***************     |  75%  9.69sec 29.08sec
|****************    |  80%  8.11sec 32.44sec
|*****************   |  85%  6.04sec 34.22sec
|******************  |  90%  3.99sec 35.92sec
|******************* |  95%  1.96sec 37.23sec
|********************| 100%  0.00sec 38.56sec
Code
effdt_battese_hat_mu_y <- predict(b_y_hat, newdata = pred)

effdt_battese_hat_mu_y <- as.data.frame(effdt_battese_hat_mu_y)
pred_effdt_battese_hat_mu_y <- cbind(pred, effdt_battese_hat_mu_y)

pred_effdt_battese_hat_mu_y$sigma <- NULL
library(terra)

myras_mu_y <- rast(pred_effdt_battese_hat_mu_y, type = "xyz")
plot(myras_mu_y)

Code
library(raster)
myras2_mu_y <- raster(myras_mu_y)

library(sf)
India_aoi_sf_dis_sp <- as_Spatial(India_aoi_sf_dis)
myras2_mu_y <- mask(myras2_mu_y, India_aoi_sf_dis_sp)
plot(myras2_mu_y, main = "Gridded predicted wheat yields (t/ha)")

4 Geospatial mapping of the efficiency parameters

Our goal here to show the spatial variation in the effect of a variable on the efficiency scores

Using formula from Pross et al (2018), technical efficiency for each grid (note that this is the same formula as the Battese formula for each farm) can be computed as:

\[ TE=E[exp(-u|\epsilon)]=\frac{exp(-u+0.5\sigma^2) \Phi (u/\sigma -\sigma)}{\Phi (u/\sigma)} \]

\[ TE=a\times b \times c \] Where \[ a=exp(-u+0.5\sigma^2) \]

\[ b= \Phi (u/\sigma -\sigma) \]

\[ c=[\Phi (u/\sigma)]^{-1} \]

\[ u=\frac{-\epsilon \sigma_u^2}{\sigma_u^2-\sigma_v^2} \]

\[ \sigma^2=\frac{\sigma_u^2 \sigma_v^2}{\sigma_u^2-\sigma_v^2} \]

\[ \sigma_u^2=(\sigma_u)^2 \alpha= (\sigma_u*)^2 exp(\eta^u)= (\sigma_u*)^2 exp(Z'\beta^u) \]

\[ \epsilon=y-\eta^y \]

The marginal effect of the explanatory variable on TE is given by:

\[ ME=\frac{\partial TE}{\partial z} \]

\[ ME=a\times (0.5 \times d-e)\times b \times c+ a\times \phi (u/\sigma -\sigma) (g-f)\times c - a \times b \times c^2 \times \phi \times (u/sigma) \times g \]

\[ d=\beta_k^u \times (\frac{\sigma^2}{\sigma_u^2}) \]

\[ e=-\epsilon \times \beta_k^u (\frac{\sigma_u^2 \sigma_v^2}{(\sigma_u^2 + \sigma_v^2})^2) \]

\[ f=0.5 \times \sigma_u \sigma_v \beta^u ((\sigma_u^2 +\sigma_v^2)^{-0.5}-(\sigma_u^2 +\sigma_v^2)^{-3/2} \sigma_u^2) \]

\[ g=\frac{\sigma^2 e -uf}{\sigma^2} \]

Code
names(pred)[1:2] <- c("O.largestPlotGPS.Longitude", "O.largestPlotGPS.Latitude")

pred$Sowing_Date_Early <- 1
pred$Nperha <- mean(model_dsfa_data$Nperha)
pred$Weedmanaged <- mean(model_dsfa_data$Weedmanaged)
pred$temp <- mean(model_dsfa_data$temp)
pred$precip <- mean(model_dsfa_data$precip)
pred$wc2.1_30s_elev <- mean(model_dsfa_data$wc2.1_30s_elev)

sowingdate_effect_on_inefficiency <- predict.gam(model_dsfa, newdata = pred, model = "sigma_u", term = "s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early")

sowingdate_effect_on_inefficiency <- as.data.frame(sowingdate_effect_on_inefficiency)
sowingdate_effect_on_inefficiency <- cbind(pred[, c("O.largestPlotGPS.Longitude", "O.largestPlotGPS.Latitude")], sowingdate_effect_on_inefficiency)

sowingdate_effect_on_inefficiency$V1 <- NULL
sowingdate_effect_on_inefficiency$V2 <- NULL

sowingdate_effect_on_inefficiency <- rename(sowingdate_effect_on_inefficiency, Early_Sowing = V3)

sow_ras <- rast(sowingdate_effect_on_inefficiency, type = "xyz")
plot(sow_ras)

Code
sow_raster <- raster(sow_ras)
sow_raster <- mask(sow_raster, India_aoi_sf_dis_sp)
plot(sow_raster, main = "Spatially differentiated effect of early sowing on mu")

The contributions of the different practices to reducing the yield gaps can be expressed in the units of yields, i.e., tons/ha. With this we can then compare which of the practices deliver more yield gap reduction thereby prioritizing which of the practices should be targeted where. We use the marginal efficiency formula that has been separately derived by Liu and Myers (2008), Olsen and Henningsen (2011) and Pross et al (2018).

5 Impact of z on TE

5.1 Non-spatial model

Code
## Effect of sowing date and weed management practices on inefficiency

te_score_sowing <- frontier::efficiencies(sfa_f, margEff = TRUE)
Z_sowing <- attributes(te_score_sowing)
Z_sowing_dt <- as.data.frame(Z_sowing$margEff)

ggplot(data = Z_sowing_dt) +
    geom_histogram(aes(x = efficiency.Z_Sowing_Date_Early), bins = 20) +
    # scale_x_continuous(breaks = scales::pretty_breaks(n = 5), expand = c(0, 0)) +
    labs(title = "Z sowing marginal efficiency", x = "Marginal efficiency") +
    theme_bw()

Code
summary(Z_sowing_dt$efficiency.Z_Sowing_Date_Early)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.006794 0.039267 0.061676 0.057467 0.077780 0.084606 
Code
## Weed management

ggplot(data = Z_sowing_dt) +
    geom_histogram(aes(x = efficiency.Z_Weedmanaged), , bins = 20) +
    # scale_x_continuous(breaks = scales::pretty_breaks(n = 5), expand = c(0, 0)) +
    labs(title = "Z weeding marginal efficiency", x = "Marginal efficiency") +
    theme_bw()

Code
summary(Z_sowing_dt$efficiency.Z_Weedmanaged)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.004474 0.025856 0.040612 0.037840 0.051216 0.055711 

Early sowing of wheat increases efficiency by 5.6 percentage points while weeding improves efficiency by about 3.7 percentage points.

Translating this in the context of yield gaps it means that early sowing reduces efficiency yield gap by the same percentage points say from 50 percent to 45 percent. In terms of the quantities, one can also take the same percentage point contribution.

We also note that the partial effect on efficiency is the same as the partial effect on yield. That means that the percentage point increase in efficiency due to early sowing or other management variable will also increase yields by the same amount. This applies in the linear models (i.e., where the variable enters the inefficiency model linearly) but in the case of the non-linear models, one needs a variant of the formula to account for the fact that the partial effect may vary between the efficiency and the yields.

5.2 Spatial model

Code
tau_hat$sigma_c <- sqrt((tau_hat$sigma_u^2 * tau_hat$sigma_v^2) / (tau_hat$sigma_v^2 + tau_hat$sigma_u^2)) # (1/sigma_u^2+1/sigma_v^2)^(-1)

model_dsfa_data <- model_dsfa$model

tau_hat$mu_c <- -(model_dsfa_data$`log(L.tonPerHectare)` - tau_hat$mu) / tau_hat$sigma_v^2 * tau_hat$sigma_c^2 # object$family$s*mgcv::residuals.gam(object)/sigma_v^2*sigma_c^2

# tau_hat$epsilon_gam <- model_dsfa$family$s * mgcv::residuals.gam(model_dsfa)
# tau_hat$mu_cc <- -(tau_hat$epsilon_gam) / tau_hat$sigma_v^2 * tau_hat$sigma_c^2 # object$family$s*mgcv::residuals.gam(object)/sigma_v^2*sigma_c^2

# Efficency estimates based on the formula
tau_hat$u <- exp(1 / 2 * tau_hat$sigma_c^2 - tau_hat$mu_c) * stats::pnorm(tau_hat$mu_c / tau_hat$sigma_c - tau_hat$sigma_c) / stats::pnorm(tau_hat$mu_c / tau_hat$sigma_c) # Impact of Z variables on efficiency yield gap [1-TE]


# Pross formula simplification
tau_hat$a <- exp(1 / 2 * tau_hat$sigma_c^2 - tau_hat$mu_c)
tau_hat$b <- stats::pnorm(tau_hat$mu_c / tau_hat$sigma_c - tau_hat$sigma_c)
tau_hat$c <- 1 / (stats::pnorm(tau_hat$mu_c / tau_hat$sigma_c))

tau_hat$u_pross_formula <- tau_hat$a * tau_hat$b * tau_hat$c
summary(tau_hat$u_pross_formula)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1022  0.7345  0.8276  0.7947  0.8829  0.9768 
Code
# Predict the effect on inefficiency term (eta or mu)

sowing_trted_dt <- model_dsfa$model
sowing_trted_dt$Sowing_Date_Early <- 1


sowingdate_effect_on_inefficiency_obs <- predict.gam(model_dsfa, newdata = sowing_trted_dt, model = "sigma_u", term = "s.2(O.largestPlotGPS.Longitude,O.largestPlotGPS.Latitude):Sowing_Date_Early")

sowingdate_effect_on_inefficiency_obs <- as.data.frame(sowingdate_effect_on_inefficiency_obs)

sowingdate_effect_on_inefficiency_obs$V1 <- NULL
sowingdate_effect_on_inefficiency_obs$V2 <- NULL

sowingdate_effect_on_inefficiency_obs <- rename(sowingdate_effect_on_inefficiency_obs, sowing_date_early = V3)




# ME of Z

tau_hat$d <- sowingdate_effect_on_inefficiency_obs$sowing_date_early * (tau_hat$sigma_c^2 / tau_hat$sigma_u^2)

tau_hat$e <- -(model_dsfa_data$`log(L.tonPerHectare)` - tau_hat$mu) * sowingdate_effect_on_inefficiency_obs$sowing_date_early * (tau_hat$sigma_v^2 * tau_hat$sigma_u^2) / ((tau_hat$sigma_v^2 + tau_hat$sigma_u^2)^2)

tau_hat$f <- 0.5 * tau_hat$sigma_u * tau_hat$sigma_v * sowingdate_effect_on_inefficiency_obs$sowing_date_early * (((tau_hat$sigma_u^2 + tau_hat$sigma_v^2)^(-0.5)) - (((tau_hat$sigma_u + tau_hat$sigma_v)^(-1.5)) * (tau_hat$sigma_u^2)))

tau_hat$g <- ((tau_hat$sigma_c * tau_hat$e) - (tau_hat$mu_c * tau_hat$f)) / ((tau_hat$sigma_c)^2)

tau_hat$sowing_effect_on_TE1 <- tau_hat$a * (0.5 * tau_hat$d - tau_hat$e) * tau_hat$b * tau_hat$c

tau_hat$density_mu_c_sig_c <- stats::dnorm((tau_hat$mu_c / tau_hat$sigma_c) - tau_hat$sigma_c)

tau_hat$sowing_effect_on_TE2 <- tau_hat$a * tau_hat$density_mu_c_sig_c * (tau_hat$g - tau_hat$f) * tau_hat$sigma_c

tau_hat$density_mu_c_sig_c2 <- stats::dnorm((tau_hat$mu_c / tau_hat$sigma_c))

tau_hat$sowing_effect_on_TE3 <- tau_hat$a * tau_hat$b * ((tau_hat$c)^2) * tau_hat$density_mu_c_sig_c2 * tau_hat$g


tau_hat$sowing_effect_on_TE_final <- tau_hat$sowing_effect_on_TE1 + tau_hat$sowing_effect_on_TE2 - tau_hat$sowing_effect_on_TE3

tau_hat$sowing_effect_on_TE_final <- as.numeric(tau_hat$sowing_effect_on_TE_final)
summary(tau_hat$sowing_effect_on_TE_final)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-1.588755 -0.071266 -0.031679 -0.056670 -0.007815  2.606674 
Code
summary(tau_hat$sowing_effect_on_TE_final)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-1.588755 -0.071266 -0.031679 -0.056670 -0.007815  2.606674 
Code
# Mapping the results 
names(pred)[1:2] <- c("lon", "lat")
effdt_battese2 <- cbind(lon, lat, tau_hat)

effdt_battese2_sp <- SpatialPointsDataFrame(cbind(effdt_battese2$lon, effdt_battese2$lat), data = effdt_battese2, proj4string = CRS("+proj=longlat +datum=WGS84"))

library(tmap)
tmap_mode("view")
tm_shape(effdt_battese2_sp) +
    tm_dots(col = "sowing_effect_on_TE_final", title = "Impact of early sowing on inefficiency", style = "quantile") +
    tm_layout(legend.outside = TRUE)
Code
# Mapping the estimates: NOT WORKING PROPERLY
f_y_hat_TE <- sowing_effect_on_TE_final ~  s(lon, lat)

## estimate model.
b_y_hat_TE <- bamlss(f_y_hat_TE, data = effdt_battese2)
AICc -11170.8 logPost 5607.325 logLik 5615.827 edf 30.255 eps 1.7954 iteration   1
AICc -11180.9 logPost 5612.284 logLik 5621.030 edf 30.442 eps 0.0179 iteration   2
AICc -11180.9 logPost 5612.280 logLik 5621.062 edf 30.469 eps 0.0021 iteration   3
AICc -11180.9 logPost 5612.280 logLik 5621.063 edf 30.470 eps 0.0000 iteration   4
AICc -11180.9 logPost 5612.280 logLik 5621.063 edf 30.470 eps 0.0000 iteration   4
elapsed time:  0.47sec
Starting the sampler...

|                    |   0% 26.18sec
|*                   |   5% 26.98sec  1.42sec
|**                  |  10% 53.37sec  5.93sec
|***                 |  15% 53.66sec  9.47sec
|****                |  20% 50.32sec 12.58sec
|*****               |  25% 43.62sec 14.54sec
|******              |  30% 41.77sec 17.90sec
|*******             |  35% 37.11sec 19.98sec
|********            |  40% 33.01sec 22.01sec
|*********           |  45% 29.33sec 24.00sec
|**********          |  50% 27.73sec 27.73sec
|***********         |  55% 24.57sec 30.03sec
|************        |  60% 20.78sec 31.17sec
|*************       |  65% 18.04sec 33.50sec
|**************      |  70% 15.02sec 35.04sec
|***************     |  75% 12.16sec 36.48sec
|****************    |  80%  9.45sec 37.79sec
|*****************   |  85%  7.13sec 40.40sec
|******************  |  90%  4.69sec 42.22sec
|******************* |  95%  2.29sec 43.50sec
|********************| 100%  0.00sec 45.92sec
Code
effdt_battese_hat_mu_y_TE <- predict(b_y_hat_TE, newdata = pred)

effdt_battese_hat_mu_y_TE <- as.data.frame(effdt_battese_hat_mu_y_TE)
pred_effdt_battese_hat_mu_y_TE <- cbind(pred, effdt_battese_hat_mu_y_TE)

pred_effdt_battese_hat_mu_y_TE$sigma <- NULL
pred_effdt_battese_hat_mu_y_TE$Sowing_Date_Early <- NULL
pred_effdt_battese_hat_mu_y_TE$Nperha <- NULL
pred_effdt_battese_hat_mu_y_TE$Weedmanaged <- NULL
pred_effdt_battese_hat_mu_y_TE$temp <- NULL
pred_effdt_battese_hat_mu_y_TE$precip <- NULL
pred_effdt_battese_hat_mu_y_TE$wc2.1_30s_elev <- NULL

library(terra)

myras_mu_y_TE <- rast(pred_effdt_battese_hat_mu_y_TE, type = "xyz")
plot(myras_mu_y_TE)

Code
library(raster)
myras2_mu_y_TE <- raster(myras_mu_y_TE)

library(sf)
India_aoi_sf_dis_sp <- as_Spatial(India_aoi_sf_dis)
myras2_mu_y_TE <- mask(myras2_mu_y_TE, India_aoi_sf_dis_sp)
plot(-1*myras2_mu_y_TE, main = "Impact of early sowing on efficiency")

6 Impact of Z variables on optimal input use [slack resources]

To understand by how much variables that improve efficiency can allow a farmer to produce the same amount of output but using less resources, we estimate the marginal effect of these variables on slack resources: that is the difference between optimal use of resources and the current levels of use of resources.

We focus on application on reducing N overuse.

Following Pross et al, the marginal effect of Z on the slack resources is: \[ \Delta X_k=X_k^* -X_k \]

\[ \Delta X_k = X_k (\frac{1}{(1+ME/TE)^{1/\beta_k}}-1) \]

7 How much input can be reduced to achieve same level of output due to increase in efficiency

8 Non spatial model

Code
coeffs <- coefficients(sfa_f)
coeffs <- as.data.frame(coeffs)

slack_sow_N_effect_non_sp <- eff_d$Nperha*(1/(1+(Z_sowing_dt$efficiency.Z_Sowing_Date_Early/eff$efficiency)^(1/0.0018))-1)

summary(slack_sow_N_effect_non_sp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       0       0       0 
Code
slack_weed_N_effect_non_sp <- eff_d$Nperha*(1/(1+(Z_sowing_dt$efficiency.Z_Weedmanaged/eff$efficiency)^(1/0.0018))-1)

summary(slack_weed_N_effect_non_sp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       0       0       0 

9 Spatial model

Code
slack_sow_N_effect_sp <- model_dsfa_data$Nperha*(1/(1+(-1*tau_hat$sowing_effect_on_TE_final/tau_hat$eff)^(1/0.001315))-1)

summary(slack_sow_N_effect_sp)
 Expected (In)efficiency    CI_lower            CI_upper      
 Min.   :-195.467        Min.   :-165.0667   Min.   :-288.07  
 1st Qu.:   0.000        1st Qu.:   0.0000   1st Qu.:-121.68  
 Median :   0.000        Median :   0.0000   Median :   0.00  
 Mean   :  -0.664        Mean   :  -0.0252   Mean   : -49.63  
 3rd Qu.:   0.000        3rd Qu.:   0.0000   3rd Qu.:   0.00  
 Max.   :   0.000        Max.   :   0.0000   Max.   :   0.00  
 NA's   :1089            NA's   :1089        NA's   :1089     

10 Ethiopia case study

We copy code from Silva () repository to showcase how the new method can be implement

Code
# package names
packages <- c("frontier", "dplyr", "tidyr", "knitr", "car", "RColorBrewer", "DT")

# install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    install.packages(packages[!installed_packages])
}

# load packages
invisible(lapply(packages, library, character.only = TRUE))

# read .csv file with data
file <- "https://raw.githubusercontent.com/jvasco323/EiA_YGD_workflow/main/data-wheat-ethiopia.csv"
data <- read.csv(url(file))

# list variables of interest
str(data)
'data.frame':   3783 obs. of  42 variables:
 $ zone                 : chr  "WEST SHOA" "WEST SHOA" "WEST SHOA" "WEST SHOA" ...
 $ zone_new             : chr  "W SHOA" "W SHOA" "W SHOA" "W SHOA" ...
 $ farming_system       : chr  "6. Highland mixed farming system" "6. Highland mixed farming system" "6. Highland mixed farming system" "6. Highland mixed farming system" ...
 $ aez                  : chr  "M2" "M2" "M2" "M2" ...
 $ year                 : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
 $ season_year          : chr  "Meher_2009" "Meher_2009" "Meher_2009" "Meher_2009" ...
 $ hhid                 : int  1 11 11 16 17 270 281 285 292 515 ...
 $ plotid               : int  5 2 2 5 2 2 6 6 4 2 ...
 $ subplotid            : int  1 4 3 1 3 1 1 1 1 1 ...
 $ subplotsize_ha       : num  0.5 0.5 0.0625 1 1.5 0.5 0.25 0.25 1.5 0.5 ...
 $ subplot_own          : chr  "Rented-in" "Owned" "Owned" "Rented-in" ...
 $ subplot_manager      : chr  "Man" "Man" "Man" "Man" ...
 $ plotdist_min         : num  10 5 2 15 10 5 15 30 5 5 ...
 $ crop                 : chr  "Wheat_br" "Wheat_br" "Wheat_br" "Wheat_br" ...
 $ gyga_cz              : int  5501 5501 5501 5501 5501 5501 5501 5501 5501 5501 ...
 $ gyga_gdd             : num  6539 6539 6539 6539 6539 ...
 $ gyga_tseas           : int  1208 1208 1208 1208 1208 1285 1200 1200 1200 1208 ...
 $ seed_kgha            : num  206 250 192 64 33.3 ...
 $ variety              : chr  "Landrace" "unknown" "unknown" "unknown" ...
 $ gyga_ai              : num  6544 6544 6544 6544 6544 ...
 $ gyga_av_water        : int  9 9 9 9 9 7 7 7 7 9 ...
 $ soil_depth           : chr  "Deep" "Deep" "Medium" "Medium" ...
 $ soil_slope           : chr  "Medium" "Steep" "Steep" "Flat" ...
 $ waterlogging_yn      : chr  "No" "No" "No" "No" ...
 $ drought_yn           : chr  "No" "No" "Yes" "No" ...
 $ soilwatercons_yn     : chr  "Yes" "Yes" "Yes" "No" ...
 $ oxplough_freq        : int  5 4 4 4 8 4 3 5 4 5 ...
 $ oxplough_freq_cat    : chr  ">Five" "Four" "Four" "Four" ...
 $ soil_fertility       : chr  "Poor" "Poor" "Poor" "Poor" ...
 $ nfert_kgha           : num  32.3 41.3 51.7 32.3 21.5 ...
 $ pfert_kgha           : num  10.04 20.07 16.06 10.04 6.69 ...
 $ manure_yn            : chr  "No" "No" "No" "No" ...
 $ residues_yn          : chr  "No" "Yes" "Yes" "Yes" ...
 $ previous_crop        : chr  "Cereal" "Cereal" "Cereal" "Cereal" ...
 $ herb_lha             : num  0.4 2 4 1 0.2 ...
 $ handweeding_persdayha: num  24 8 64 20 16 0 16 28 0 22 ...
 $ weeding_freq         : int  1 1 1 1 1 0 2 1 1 1 ...
 $ weeding_freq_cat     : chr  "One" "One" "One" "One" ...
 $ pesticide_yn         : chr  "No" "No" "No" "No" ...
 $ disease_incidence_yn : chr  "No" "Yes" "No" "Yes" ...
 $ pest_incidence_yn    : chr  "Yes" "No" "No" "No" ...
 $ yield_tha            : num  1.2 2.4 3.2 1 0.267 ...
Code
# create final data
data <- subset(data, yield_tha > 0)
data <- subset(data, residues_yn == "No" | residues_yn == "Yes")
data <- subset(data, soil_slope == "Flat" | soil_slope == "Medium" | soil_slope == "Steep")
data <- subset(data, zone_new != "")
data <- subset(data, oxplough_freq_cat == "<Two" |
    oxplough_freq_cat == "Three" |
    oxplough_freq_cat == "Four" |
    oxplough_freq_cat == ">Five")
data <- subset(data, weeding_freq_cat == "None" |
    weeding_freq_cat == "One" |
    weeding_freq_cat == "Two" |
    weeding_freq_cat == "Three+")

# fill NA values
data$seed_kgha[is.na(data$seed_kgha)] <- mean(data$seed_kgha, na.rm = T)
data$nfert_kgha[is.na(data$nfert_kgha)] <- 0
data$herb_lha[is.na(data$herb_lha)] <- 0
data$handweeding_persdayha[is.na(data$handweeding_persdayha)] <- 0

# reclassify categorical variables
data$variety <- ifelse(data$variety != "Landrace" & data$variety != "unknown", "Improved", data$variety)
data$variety <- ifelse(data$variety == "Landrace", "unknown", data$variety)
data$nfert_yn <- ifelse(data$nfert_kgha == 0, "N0", "N+")
data$weeding_yn <- ifelse(data$herb_lha == 0 & data$handweeding_persdayha == 0, "No", "Yes")

# copy df with transformed data
data_new <- data

# replace 0 with small value for log-transformation
data_new[data_new == 0] <- 0.0001

# log-transform continuous variables
vars1 <- c(
    "gyga_gdd", "gyga_tseas", "seed_kgha", "gyga_ai", "gyga_av_water", "nfert_kgha", "pfert_kgha",
    "herb_lha", "handweeding_persdayha", "yield_tha"
)
log_f <- function(x) {
    log(x)
}
data_new[, vars1] <- lapply(data_new[, vars1], log_f)

# set categorical variables to factor
vars2 <- c(
    "farming_system", "aez", "zone_new", "season_year", "variety", "soil_depth", "soil_fertility",
    "waterlogging_yn", "drought_yn", "soilwatercons_yn", "manure_yn", "residues_yn", "previous_crop",
    "oxplough_freq_cat", "weeding_yn", "pesticide_yn", "disease_incidence_yn", "pest_incidence_yn"
)
data_new[, vars2] <- lapply(data_new[, vars2], factor)

vars3 <- c("year", "handweeding_persdayha", "herb_lha", "pfert_kgha", "nfert_kgha", "seed_kgha", "subplotsize_ha", "yield_tha")

# mean
library(dplyr)
numeric_cols_mean <- data[, vars3] %>%
    group_by(year) %>%
    summarise(across(
        .cols = where(is.numeric),
        .fns = list(Mean = mean), na.rm = TRUE,
        .names = "{col}"
    ))
numeric_cols_mean <- round(numeric_cols_mean, 2)
numeric_cols_mean <- t(numeric_cols_mean)
colnames(numeric_cols_mean)[2] <- "Mean 2013"
colnames(numeric_cols_mean)[1] <- "Mean 2009"
numeric_cols_mean <- numeric_cols_mean[-1, ]
Variable <- rownames(numeric_cols_mean)
rownames(numeric_cols_mean) <- NULL
numeric_cols_mean <- cbind(Variable, numeric_cols_mean)
# sd
numeric_cols_sd <- data[, vars3] %>%
    group_by(year) %>%
    summarise(across(
        .cols = where(is.numeric),
        .fns = list(SD = sd), na.rm = TRUE,
        .names = "{col}"
    ))
numeric_cols_sd <- round(numeric_cols_sd, 2)
numeric_cols_sd <- t(numeric_cols_sd)
colnames(numeric_cols_sd)[2] <- "StDev 2013"
colnames(numeric_cols_sd)[1] <- "StDev 2009"
numeric_cols_sd <- numeric_cols_sd[-1, ]
Variable <- rownames(numeric_cols_sd)
rownames(numeric_cols_sd) <- NULL
numeric_cols_sd <- cbind(Variable, numeric_cols_sd)
# merge
numeric_cols <- merge(numeric_cols_mean, numeric_cols_sd, by = "Variable")
numeric_cols$Variable[1] <- "Hand-weeding (person-day/ha)"
numeric_cols$Variable[2] <- "Herbicide use (L/ha)"
numeric_cols$Variable[3] <- "N application rate (kg N/ha)"
numeric_cols$Variable[4] <- "P application rate (kg P/ha)"
numeric_cols$Variable[5] <- "Seed rate (kg/ha)"
numeric_cols$Variable[6] <- "Plot size (ha)"
numeric_cols$Variable[7] <- "Actual wheat yield (t/ha)"
# show
knitr::kable(numeric_cols)
Variable Mean 2009 Mean 2013 StDev 2009 StDev 2013
Hand-weeding (person-day/ha) 21.84 24.28 30.46 35.46
Herbicide use (L/ha) 0.5 0.59 0.83 0.88
N application rate (kg N/ha) 48.07 48.99 40.53 32.04
P application rate (kg P/ha) 19.63 20.34 13.93 12.36
Seed rate (kg/ha) 192.88 195.43 79.85 95.66
Plot size (ha) 0.45 0.4 0.39 0.3
Actual wheat yield (t/ha) 1.76 1.77 1.13 1.09
Code
# fit ols regression model
ols <-
    lm(
        yield_tha ~
            season_year + gyga_gdd + gyga_tseas + seed_kgha + variety +
            gyga_ai + gyga_av_water + soil_depth + soil_fertility + waterlogging_yn + drought_yn + soilwatercons_yn +
            nfert_kgha + manure_yn + residues_yn + previous_crop + oxplough_freq_cat +
            herb_lha + handweeding_persdayha + weeding_yn + pesticide_yn + disease_incidence_yn + pest_incidence_yn,
        data = data_new
    )

# check vif values
# vif(ols)

# see parameter estimates
# summary(ols)

# fit cobb-douglas stochastic frontier
sfa_cd <-
    sfa(
        yield_tha ~
            season_year + gyga_gdd + gyga_tseas + seed_kgha + variety +
            gyga_ai + gyga_av_water + soil_depth + soil_fertility + waterlogging_yn + drought_yn + soilwatercons_yn +
            nfert_kgha + manure_yn + residues_yn + previous_crop + oxplough_freq_cat +
            herb_lha + handweeding_persdayha + weeding_yn + pesticide_yn + disease_incidence_yn + pest_incidence_yn,
        data = data_new
    )

# add technical efficiency score to data frame
data_new$te_score_cd <- efficiencies(sfa_cd, asInData = T)

# see parameter estimates
summary(sfa_cd)
Error Components Frontier (see Battese & Coelli 1992)
Inefficiency decreases the endogenous variable (as in a production function)
The dependent variable is logged
Iterative ML estimation terminated after 33 iterations:
log likelihood values and parameters of two successive iterations
are within the tolerance limit

final maximum likelihood estimates
                          Estimate Std. Error  z value  Pr(>|z|)    
(Intercept)              9.7733888  1.1050740   8.8441 < 2.2e-16 ***
season_yearMeher_2013   -0.0539923  0.0264810  -2.0389  0.041460 *  
gyga_gdd                -0.5840446  0.0909876  -6.4189 1.372e-10 ***
gyga_tseas              -0.3252022  0.0525343  -6.1903 6.006e-10 ***
seed_kgha                0.0976523  0.0123652   7.8974 2.849e-15 ***
varietyunknown          -0.0018340  0.0231824  -0.0791  0.936944    
gyga_ai                 -0.3311850  0.0594151  -5.5741 2.488e-08 ***
gyga_av_water           -0.0080901  0.0412886  -0.1959  0.844658    
soil_depthMedium        -0.0742006  0.0232508  -3.1913  0.001416 ** 
soil_depthShallow       -0.0812014  0.0291497  -2.7857  0.005342 ** 
soil_fertilityMedium    -0.0568591  0.0203025  -2.8006  0.005101 ** 
soil_fertilityPoor      -0.1627704  0.0309154  -5.2650 1.402e-07 ***
waterlogging_ynYes      -0.3472614  0.0379737  -9.1448 < 2.2e-16 ***
drought_ynYes           -0.4474776  0.0456593  -9.8004 < 2.2e-16 ***
soilwatercons_ynYes      0.0585665  0.0275969   2.1222  0.033820 *  
nfert_kgha               0.2723653  0.0126521  21.5272 < 2.2e-16 ***
manure_ynYes             0.0373056  0.0261372   1.4273  0.153494    
residues_ynYes           0.0316401  0.0270959   1.1677  0.242925    
previous_cropLegume      0.0220438  0.0244585   0.9013  0.367442    
previous_cropOther       0.1230667  0.0264684   4.6496 3.326e-06 ***
oxplough_freq_cat>Five   0.0534308  0.0580652   0.9202  0.357475    
oxplough_freq_catFour   -0.0126670  0.0569799  -0.2223  0.824075    
oxplough_freq_catThree  -0.1015532  0.0573738  -1.7700  0.076722 .  
herb_lha                 0.0134322  0.0029268   4.5894 4.446e-06 ***
handweeding_persdayha   -0.0038850  0.0020551  -1.8905  0.058697 .  
weeding_ynYes            0.0369644  0.0496531   0.7445  0.456602    
pesticide_ynYes          0.1208168  0.0495391   2.4388  0.014735 *  
disease_incidence_ynYes -0.3161391  0.0314308 -10.0582 < 2.2e-16 ***
pest_incidence_ynYes    -0.0857481  0.0733460  -1.1691  0.242367    
sigmaSq                  0.5977253  0.0250709  23.8414 < 2.2e-16 ***
gamma                    0.7421739  0.0246107  30.1565 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
log likelihood value: -3066.353 

cross-sectional data
total number of observations = 3694 

mean efficiency: 0.633058 
Code
# estimate efficiency yield gap (%)
data_new["efficiency_yg"] <- 100 - (data_new["te_score_cd"] * 100)

# select relevant columns
data_new <- data_new[c(
    "zone_new", "season_year", "hhid", "plotid", "subplotid", "te_score_cd",
    "efficiency_yg"
)]

# merge the new columns to original data frame
data <- merge(data, data_new, by = c("zone_new", "season_year", "hhid", "subplotid"), all.x = T)

# estimate technical efficiency yield (t/ha)
data["ytex_tha"] <- data["yield_tha"] / data["te_score_cd"]

# create an empty data frame
data_final <- data.frame()

# create loop per year
for (yr in unique(data$year)) {
    subset_year <- subset(data, year == yr)

    # create loop per climate zone
    for (cz in unique(subset_year$gyga_cz)) {
        subset_cz <- subset(subset_year, gyga_cz == cz)

        # create loop per soil type
        for (soil in unique(subset_cz$soil_fertility)) {
            subset_soil <- subset(subset_cz, soil_fertility == soil)

            # create column with field class based on yield distribution
            subset_soil$field_class <- ifelse(subset_soil$yield_tha >= quantile(subset_soil$yield_tha, 0.90),
                "YHF", ""
            )
            subset_soil$field_class <- ifelse(subset_soil$yield_tha <= quantile(subset_soil$yield_tha, 0.10),
                "YLF", subset_soil$field_class
            )
            subset_soil$field_class <- ifelse(subset_soil$yield_tha > quantile(subset_soil$yield_tha, 0.10) &
                subset_soil$yield_tha < quantile(subset_soil$yield_tha, 0.90),
            "YAF", subset_soil$field_class
            )

            # subset highest yielding fields only
            yhf <- subset(subset_soil, field_class == "YHF")

            # add column with yhf in t/ha to data frame
            subset_soil["yhf_tha"] <- mean(yhf$yield_tha, na.rm = T)

            # bind all individual fields into single data frame
            data_final <- rbind(data_final, subset_soil)
        }
    }
    
}

# 

10.1 geoadditive efficiency model

10.1.1 Collect geocoodinates or district shapefiles z

Code
# Write formulae for parameters for the mean model, sigma model, and inefficiency model
# We use the original functional forms: identity for mean model, log sigma for sigma model and log ineff for the inefficiency

# mu_formula <- yield_tha ~
#             season_year + gyga_gdd + gyga_tseas + seed_kgha + variety +
#             gyga_ai + gyga_av_water + soil_depth + soil_fertility + waterlogging_yn + drought_yn + soilwatercons_yn +
#             nfert_kgha + manure_yn + residues_yn + previous_crop + oxplough_freq_cat +
#             herb_lha + handweeding_persdayha + weeding_yn + pesticide_yn + disease_incidence_yn + pest_incidence_yn

# sigma_v_formula <- ~ 1 

# sigma_u_formula <- ~ 1 + weeding_yn


#s <- -1 # production function specification

# Fit model
# If using older versions of R you can use the following code
# model <- mgcv::gam(
#     formula = list(mu_formula, sigma_v_formula, sigma_u_formula),
#     data = LDS_wheat_public_cleaned, family = normhnorm(s = s), optimizer = c("efs")
# )

# Eth_model_dsfa<-dsfa(formula=list(mu_formula, sigma_v_formula, sigma_u_formula),
#                  data=data_final,  family=comper(s=s, distr="normhnorm"), optimizer = c("efs"))
# summary(Eth_model_dsfa)

# plot(Eth_model_dsfa, select = 1) # Estimated function
# plot(Eth_model_dsfa, select = 2) 

11 Discussion: Yield gap analysis as a prioritization framework

We turn next to discuss how yield gap analysis using stochastic frontier analysis can be used as a prioritization framework. Similar ideas can be extended to account for bad outputs like GHG and multiple outputs.

12 Conclusion

13 References

Liu, Y., and Myers, R. 2009. “Model selection in stochastic frontier analysis with an application to maize production in Kenya”. Journal of Productivity Analysis 31: 33-46. Doi: https://doi.org/10.1007/s11123-008-0111-9.

Olsen, J.V., and Henningsen, A.2011. “Investment Utilisation, Adjustment Costs, and Technical Efficiency in Danish Pig Farms.” FOI Working Paper, No. 2011/13, University of Copenhagen, Department of Food and Resource Economics (IFRO),Copenhagen.

Pross, C., Strumann, C., Geissler, A., Herwatz, H., and Klein, N. 2018. “Quality and resource efficiency in hospital service provision: A geoadditive stochastic frontier analysis of stroke quality of care in Germany”. PLOS One. Doi:https://doi.org/10.1371/journal.pone.0203017.